To start working on the homework we have to download stocks data. For this we have first created the CSV file named as stocks.csv which simply contains the names of the comapnies for which we have to collect data. To do this we have also focused on collecting the 10 companies data in each category discussed in homework description. We have also noticed that instead of 10 categories there are 11 categories. Also for this part we have focused on collecting the data from January 1, 2003 through January 1, 2008.
This makes us total of 110 companies categorized into 11 Global Industry Classification Standard (GICS).
Lets assume we have already all the packages. I will include them here.
require(tseries, warn.conflicts=F, quietly = TRUE)
require(igraph, warn.conflicts=F, quietly = TRUE)
require(lemon,warn.conflicts=F, quietly = TRUE)
require(GGally,warn.conflicts=F, quietly = TRUE)
require(energy,warn.conflicts=F, quietly = TRUE)
require(randomcoloR,warn.conflicts=F, quietly = TRUE)
After that we just read the csv file which have rows. We have showed the rows that it contains. We also have to use the lemon library to pretty print the dataframe values.
stock <- read.csv("stocks.csv")
knitr::kable(head(stock), format="markdown")
| Symbol | Security | GICSSector | GICS.Sub.Industry | Location | Date.first.added | CIK | Founded |
|---|---|---|---|---|---|---|---|
| A | Agilent Technologies Inc | Health Care | Health Care Equipment | Santa Clara, California | 6/5/2000 | 1090872 | 1999 |
| UNP | Union Pacific | Industrials | Railroads | Omaha, Nebraska | 100885 | ||
| AAP | Advance Auto Parts | Consumer Discretionary | Automotive Retail | Roanoke, Virginia | 7/9/2015 | 1158449 | 1932 |
| AAPL | Apple Inc. | Information Technology | Technology Hardware, Storage & Peripherals | Cupertino, California | 11/30/1982 | 320193 | 1977 |
| ABC | AmerisourceBergen Corp | Health Care | Health Care Distributors | Chesterbrook, Pennsylvania | 8/30/2001 | 1140859 | 1985 |
| ABMD | ABIOMED Inc | Health Care | Health Care Equipment | Danvers, Massachusetts | 5/31/2018 | 815094 | 1981 |
From above we can see that Symbol column contains the companies name and GICSSector contains the category. After that we have downloaded data from January 1, 2003 through January 1, 2008.
symbols <- stock[["Symbol"]]
data <- NULL
options("getSymbols.yahoo.warning"=FALSE)
options("getSymbols.warning4.0"=FALSE)
for (row in symbols) {
if (is.null(data)) {
data <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
}
else {
temp <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
data <- cbind(data, temp)
}
}
cMatrix <- as.matrix(data)
After downloading the data in cMatrix we have to assign the names to columns and we have to remove the na values and infinite values. We are replacing them with 0 instead of removing the whole row. For getting required X matrix we also have to use diff and log formula on our matrix. I will be outputing its first few rows and columns using kable function.
cMatrix <- as.matrix(data)
colnames(cMatrix) <- symbols
cMatrix[is.na(cMatrix)] <- 0
logMatrix <- diff(log(cMatrix))
logMatrix[is.na(logMatrix)] <- 0
logMatrix[is.infinite(logMatrix)] <- 0
knitr::kable(head(logMatrix[,1:20],6), format="markdown")
| A | UNP | AAP | AAPL | ABC | ABMD | ABT | ACN | ADBE | ADI | ADM | ADP | ADS | ADSK | AEE | AEP | AES | AFL | AGN | AIG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003-01-03 | -0.0047133 | -0.0065585 | -0.0155700 | 0.0067342 | 0.0149835 | 0.0231076 | 0.0096908 | 0.0074946 | 0.0269766 | 0.0042381 | 0.0063694 | -0.0056783 | -0.0027816 | -0.0284840 | 0.0136249 | 0.0200253 | 0.0628009 | 0.0003165 | 0.0193509 | -0.0029895 |
| 2003-01-06 | 0.0466631 | 0.0179333 | -0.0659116 | 0.0000000 | 0.0118948 | 0.0275362 | 0.0064087 | 0.0184946 | 0.0449806 | 0.0596970 | 0.0015860 | 0.0027197 | -0.0168544 | 0.0284840 | 0.0418107 | 0.0590889 | 0.0000000 | 0.0125788 | 0.0129210 | 0.0330474 |
| 2003-01-07 | -0.0090589 | 0.0001616 | 0.0056436 | -0.0033619 | -0.0260110 | -0.0849932 | -0.0460053 | -0.0195618 | 0.0357053 | 0.0046974 | -0.0143658 | -0.0094270 | 0.0056497 | 0.0422456 | -0.0350673 | -0.0175298 | -0.0234615 | -0.0084733 | -0.0033841 | -0.0196643 |
| 2003-01-08 | -0.0497512 | -0.0130083 | -0.0062969 | -0.0204083 | -0.0372670 | -0.0300159 | 0.0248996 | -0.0069649 | -0.0496148 | -0.0266682 | -0.0072610 | -0.0085107 | -0.0028208 | -0.0105612 | 0.0018522 | -0.0060240 | -0.0059524 | 0.0022037 | -0.0164050 | -0.0125517 |
| 2003-01-09 | 0.0390776 | 0.0105821 | 0.0140589 | 0.0088943 | -0.0128590 | 0.2314195 | 0.0032570 | -0.0124426 | 0.0400993 | 0.0484199 | 0.0128723 | 0.0159604 | 0.0168071 | 0.0138388 | -0.0224571 | 0.0070246 | 0.0409414 | 0.0125002 | 0.0027529 | 0.0363914 |
| 2003-01-10 | 0.0167220 | -0.0035691 | -0.0032269 | 0.0027219 | -0.0113658 | -0.0968498 | -0.0050151 | -0.0822226 | 0.0144140 | 0.0205977 | -0.0088319 | -0.0106952 | 0.0456105 | -0.0245127 | -0.0066461 | -0.0537525 | -0.0379608 | -0.0125002 | -0.0093217 | -0.0095002 |
After that we plotted logMatrix which will be our required X matrix.
plot(logMatrix)
Above completes point 2 part of the homework in Your Job section.
Lets move to point 3.
For this we have to calculate Pearson correlation coefficient between stocks. For this we have to simply use cor function.
pearson_correlation_matrix <- cor(logMatrix, method = "pearson", use = "everything")
R <- pearson_correlation_matrix
R[is.na(R)] <- 0
knitr::kable(head(R[,1:20],6), format="markdown") # outputing the rows
| A | UNP | AAP | AAPL | ABC | ABMD | ABT | ACN | ADBE | ADI | ADM | ADP | ADS | ADSK | AEE | AEP | AES | AFL | AGN | AIG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 1.0000000 | 0.3132293 | 0.2453006 | 0.2894047 | 0.2060193 | 0.1929294 | 0.2173697 | 0.2595466 | 0.3440663 | 0.4530926 | 0.1865611 | 0.3108620 | 0.2168602 | 0.2993274 | 0.2265353 | 0.2792155 | 0.2083448 | 0.2708431 | 0.2246741 | 0.3329415 |
| UNP | 0.3132293 | 1.0000000 | 0.2409231 | 0.2394450 | 0.2275416 | 0.1805622 | 0.2706303 | 0.2816294 | 0.2722011 | 0.2881027 | 0.2555853 | 0.3005328 | 0.2097893 | 0.2803825 | 0.2796656 | 0.2944920 | 0.2480392 | 0.2920236 | 0.1999380 | 0.3494798 |
| AAP | 0.2453006 | 0.2409231 | 1.0000000 | 0.1999078 | 0.1331749 | 0.1000599 | 0.1607918 | 0.1810424 | 0.1746663 | 0.2027267 | 0.1106188 | 0.1685443 | 0.1931692 | 0.1736631 | 0.1184338 | 0.1588764 | 0.1793692 | 0.2229857 | 0.1487453 | 0.2467757 |
| AAPL | 0.2894047 | 0.2394450 | 0.1999078 | 1.0000000 | 0.1270099 | 0.0927325 | 0.1895600 | 0.1360781 | 0.3016231 | 0.2960004 | 0.1349346 | 0.2564773 | 0.2002367 | 0.2784136 | 0.1937273 | 0.2194870 | 0.1332074 | 0.2081890 | 0.2015270 | 0.2546643 |
| ABC | 0.2060193 | 0.2275416 | 0.1331749 | 0.1270099 | 1.0000000 | 0.1144457 | 0.2634622 | 0.1995059 | 0.1942354 | 0.1777007 | 0.1762504 | 0.2228011 | 0.1627588 | 0.1479000 | 0.2147295 | 0.2136665 | 0.1685642 | 0.1420544 | 0.2784182 | 0.2002196 |
| ABMD | 0.1929294 | 0.1805622 | 0.1000599 | 0.0927325 | 0.1144457 | 1.0000000 | 0.1759887 | 0.1675586 | 0.1482111 | 0.1929230 | 0.0936166 | 0.1393736 | 0.1013807 | 0.1779248 | 0.1643561 | 0.2037472 | 0.1463400 | 0.1504244 | 0.1378275 | 0.1819907 |
From above we can see that columns and rows that have same value have higher dependency on each other usaully 1 as you can see the first row and first column value of “A” company.
Now lets move to calculating Marginal Correlation Graphs.. But first we have to calculating the bootstrap confidence interval for this we have used our own algorithm instead of using Boot function.
#Calculating bootstarp confidence interval
B = 1000 # considering only 1000 iteration for sampling purposes
brep = rep(NA, B)
for (b in 1:B){
row_idx <- sample(1:nrow(logMatrix), replace = T)
bSample <- logMatrix[row_idx,] # bootstrap sample
rownames(bSample) <- rownames(logMatrix)
sample_Cor <- cor(bSample, method = "pearson", use = "everything") # getting correlation from original matrix for sample
sample_Cor[is.na(sample_Cor)] <- 0 # replacing values
btheta <- sqrt(nrow(logMatrix))*max(sample_Cor-R) # bootstrap replicate
brep[b] <- btheta # save
}
After above we have our bootstrapped replicated values. Till now we have ∆b After we will calculate the our Emperical CDF.
Gstar<- ecdf(brep)
plot(Gstar)
From above you can see that values graph changes between 0 to 1. Now I have Fn(t) hat. I am going to move t alpha based on 0.95 confidence.
confidence <- as.numeric(quantile(brep,0.95)/sqrt(nrow(logMatrix)))
confidence
## [1] 0.1817225
From above you can notice that its value is 0.18. Lets move to building PR[j,k]∈Cj,k(α) for all (j,k) →n 1−α for this We needed to build new metrix based on this condition “If we have a confidence interval Cn,α then we can put an edge whenever [−ε, +ε] ∩ Cn,α = ∅.”
We wrote function which will create matrix for edges which is based on our confidence interval and error as ee = 0.1
matrix_apply <- function(m,ee,con) {
m2 <- m
for (r in seq(nrow(m2))){
for (c in seq(ncol(m2))){
l <- m[[r,c]]-con
h <- m[[r,c]]+con
eel <- ee*-1
eeh <- ee
if(h <= eel || l >= eeh){
m2[[r,c]]<-1
}
else{
m2[[r,c]]<-0
}
}
}
return(m2)
}
ee<-0.13
m<-matrix_apply(R,ee,confidence)
knitr::kable(head(m[,1:10],6), format="markdown")
| A | UNP | AAP | AAPL | ABC | ABMD | ABT | ACN | ADBE | ADI | |
|---|---|---|---|---|---|---|---|---|---|---|
| A | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| UNP | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| AAP | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| AAPL | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| ABC | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| ABMD | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
From above we have 1 in the matrix when edge is possible and 0 otherwise. Moving to creating a graph from adjacency matrix as m and simplify it if there are some stocks which points to itself. But before doing I need a function which will color the stocks which has same GICS .
color_graph <- function(graph,stocks,sectors,colors){
for(i in 1:nrow(stocks)) {
row <- stocks[i,]
#assigning the colors
for(j in 1:length(sectors)) {
if(sectors[j] == row$GICSSector){
V(graph)[i]$color <- colors[j] # assigning color to each stock but assign same color which are in same sector
break
}
}
}
return (graph)
}
Lets plot the graph after using above function for coloring.
ee <- 0.13
sectors <- unique(stock$GICSSector) # getting GICS
colors <- distinctColorPalette(length(sectors)) # getting colors based on GICS
m <- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock,sectors,colors))
legend("bottomleft", legend= sectors , col = colors , bty = "n", pch=20 , pt.cex = 0.7, cex = 0.35, text.col=colors , horiz = FALSE, inset = c(0.1, 0.1))
From above you can see that with error 0.13 we get the perfect result of what we have wanted to support our claim that “stocks from the same sector cluster together”. To support this claim we have drawn the graph containing unique color for stocks that are in the same GICS
We will going to run this with different error values to see what will happen to our stocks when we increase error value ee
eeee<- 0.1
for( i in 1:10){
ee <-ee + 0.02
m<- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
title = paste("Error: ",ee," and Confidence Level: ",round(0.95,2),sep = "")
plot(color_graph(g,stock,sectors,colors) ,main="")
title(main =title, line = 0.5, cex.main = 0.4)
}
From above you can see that our confidence value wa 0.18 and as we increase our error value from 0.10 to 0.37 stocks from same GICS started spreading instead of merging on same clustor. But still you can see that stocks from same GICS always clustered togather this proves our hypothesis.
Now Lets move to Point 4.
To start working on this we need to decrease number of stocks otherwise the result are not statisfactory and the plot we have is not clustored togather.
-> We have decrease the size of the stocks, considering only 4 stocks from each category.
options("getSymbols.yahoo.warning"=FALSE)
options("getSymbols.warning4.0"=FALSE)
stock <- read.csv("stocks_short.csv")
symbols <- stock[["Symbol"]]
data <- NULL
for (row in symbols) {
if (is.null(data)) {
data <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
}
else {
temp <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
data <- cbind(data, temp)
}
}
cMatrix <- as.matrix(data)
colnames(cMatrix) <- symbols
cMatrix[is.na(cMatrix)] <- 0
logMatrix <- diff(log(cMatrix))
logMatrix[is.na(logMatrix)] <- 0
plot(logMatrix)
Above steps are the same for creating our Matrix X. I have just changed the CSV file to have only 34 companies names instead of 110 as I have did before.
# created empty correlation matrix
cor_mat <- matrix(1:ncol(logMatrix)**2,nrow=ncol(logMatrix),dimnames=list(colnames(logMatrix),colnames(logMatrix)))
logMatrix[is.na(logMatrix)] <- 0 # na values
logMatrix[is.infinite(logMatrix)] <- 0 # removing infinite values.
alpha <- 0.05 # considering alpha directly instead of calculating it from confidence level
for(i in seq(nrow(cor_mat))){
for(j in seq(ncol(cor_mat))){
X <- logMatrix[,i]
Y <- logMatrix[,j]
dd <- dcov.test(X,Y,index=0.01,R=100) # this test will give us distance covariance value. This will do all the testing as behind the scene as described in class
cor_mat[i,j] <- dd$p.value
}
}
After above code will run we will have cor_mat which contains our desired values. But it takes around 2 hours to run. So we have already saved it to Q2Final.RData file which has already that matrix.
load("Q2Final.RData")
For plotting we need again the matrix_apply function for getting edges for following this condition on matrix each value “if and only if we reject the null hypothesis that γ2 = 0.”. So
matrix_apply_alpha <- function(m,alpha) {
m2 <- m
for (r in seq(nrow(m2))){
for (c in seq(ncol(m2))){
val <- m[[r,c]]
if(val>=alpha){
m2[[r,c]]<-1
}
else{
m2[[r,c]]<-0
}
}
}
return(m2)
}
color_graph <- function(graph,stocks,sectors,colors){
for(i in 1:nrow(stocks)) {
row <- stocks[i,]
#assigning the colors
for(j in 1:length(sectors)) {
if(sectors[j] == row$GICSSector){
V(graph)[i]$color <- colors[j] # assigning color to each stock but assign same color which are in same sector
break
}
}
}
return (graph)
}
sectors <- unique(stock$GICSSector) # getting GICS
colors <- distinctColorPalette(length(sectors)) # getting colors based on GICS
m <- matrix_apply_alpha(cor_mat,alpha)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock,sectors,colors) ,main="")
title(main ="Marginal Correlation Graph", line = 0.5, cex.main = 0.4)
Finally, we have all the things we need. Lets move to point 5.
Now, for this part we have focused on collecting the data from January 1, 2013 through January 1, 2018.
This makes us total of 110 companies categorized into 11 Global Industry Classification Standard (GICS).
stock <- read.csv("stocks.csv")
symbols <- stock[["Symbol"]]
data <- NULL
options("getSymbols.yahoo.warning"=FALSE)
options("getSymbols.warning4.0"=FALSE)
for (row in symbols) {
if (is.null(data)) {
data <- suppressWarnings(get.hist.quote(instrument = row, start = "2013-01-01", end = "2018-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
}
else {
temp <- suppressWarnings(get.hist.quote(instrument = row, start = "2013-01-01", end = "2018-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
data <- cbind(data, temp)
}
}
cMatrix <- as.matrix(data)
After downloading the data in cMatrix we have to assign the names to columns and we have to remove the na values and infinite values. We are replacing them with 0 instead of removing the whole row. For getting required X matrix we also have to use diff and log formula on our matrix. I will be outputing its first few rows and columns using kable function.
cMatrix <- as.matrix(data)
colnames(cMatrix) <- symbols
cMatrix[is.na(cMatrix)] <- 0
logMatrix <- diff(log(cMatrix))
logMatrix[is.na(logMatrix)] <- 0
logMatrix[is.infinite(logMatrix)] <- 0
knitr::kable(head(logMatrix[,1:20],6), format="markdown")
| A | UNP | AAP | AAPL | ABC | ABMD | ABT | ACN | ADBE | ADI | ADM | ADP | ADS | ADSK | AEE | AEP | AES | AFL | AGN | AIG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013-01-03 | 0.0035753 | 0.0014002 | 0.0000000 | -0.0127026 | -0.0020716 | -0.0176865 | 0.0373589 | -0.0036266 | -0.0155083 | -0.0162679 | -0.0080546 | 0.0039313 | 0.0080194 | -0.0155507 | -0.0041593 | -0.0006876 | -0.0054695 | -0.0256640 | 0.0201029 | -0.0082577 |
| 2013-01-04 | 0.0195554 | 0.0173395 | 0.0154682 | -0.0282499 | 0.0066597 | -0.0029784 | -0.0060296 | 0.0055073 | 0.0100159 | -0.0179471 | 0.0270567 | 0.0088316 | 0.0057079 | -0.0002749 | 0.0000000 | -0.0016061 | 0.0234880 | -0.0126028 | -0.0072421 | 0.0033112 |
| 2013-01-07 | -0.0072591 | -0.0047480 | -0.0034016 | -0.0058997 | 0.0031993 | -0.0120031 | 0.0081314 | -0.0043454 | -0.0049955 | 0.0030528 | -0.0422918 | -0.0038967 | 0.0216695 | -0.0074535 | -0.0125829 | -0.0041417 | -0.0317487 | -0.0038506 | 0.0042115 | -0.0102451 |
| 2013-01-08 | -0.0080227 | -0.0006912 | -0.0164908 | 0.0026878 | -0.0013699 | -0.0030234 | 0.0002998 | 0.0057896 | 0.0052576 | -0.0103702 | 0.0117126 | 0.0057549 | 0.0051677 | 0.0058019 | 0.0045352 | -0.0050855 | 0.0000000 | 0.0115076 | 0.0118372 | -0.0078234 |
| 2013-01-09 | 0.0266496 | 0.0054392 | 0.0030437 | -0.0157523 | 0.0000000 | -0.0284088 | 0.0065751 | 0.0070468 | 0.0135419 | -0.0026094 | 0.0049279 | -0.0025349 | 0.0023168 | 0.0101412 | 0.0045146 | 0.0002317 | 0.0045977 | 0.0011435 | 0.0089584 | 0.0030807 |
| 2013-01-10 | 0.0073547 | 0.0015268 | -0.0087409 | 0.0123198 | 0.0027378 | -0.0031201 | 0.0083061 | -0.0087802 | -0.0010352 | 0.0120413 | -0.0049279 | 0.0037156 | -0.0047681 | -0.0040989 | 0.0127879 | 0.0089955 | 0.0172810 | 0.0164356 | 0.0060416 | 0.0011180 |
After that we plotted logMatrix which will be our required X matrix.
plot(logMatrix)
Comparison: If we look at result of this matrix you can clearly see that it x-axis is between -0.05 to 0.05. In new matrix, we values are more desprerse on X-axis and previous data values are more despersed in Y-axis. but both of the graph are concenterated over 0.
Above completes point 2 part of the homework in Your Job section for data 20013 to 2018
Lets move to point 3.
For this we have to calculate Pearson correlation coefficient between stocks. For this we have to simply use cor function.
pearson_correlation_matrix <- cor(logMatrix, method = "pearson", use = "everything")
R <- pearson_correlation_matrix
R[is.na(R)] <- 0
knitr::kable(head(R[,1:20],6), format="markdown") # outputing the rows
| A | UNP | AAP | AAPL | ABC | ABMD | ABT | ACN | ADBE | ADI | ADM | ADP | ADS | ADSK | AEE | AEP | AES | AFL | AGN | AIG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 1.0000000 | 0.4265864 | 0.2084390 | 0.3028485 | 0.2831098 | 0.2293371 | 0.5175138 | 0.4815826 | 0.4311880 | 0.4734604 | 0.3342113 | 0.4471464 | 0.3796908 | 0.4548649 | 0.2030784 | 0.1914714 | 0.3294797 | 0.4427146 | 0.3577668 | 0.4766110 |
| UNP | 0.4265864 | 1.0000000 | 0.2594759 | 0.2688674 | 0.2413006 | 0.1741804 | 0.3541362 | 0.3972991 | 0.3231438 | 0.4270688 | 0.3507299 | 0.4396891 | 0.3737763 | 0.3100782 | 0.1960577 | 0.1839721 | 0.2933191 | 0.4613683 | 0.2361382 | 0.4359570 |
| AAP | 0.2084390 | 0.2594759 | 1.0000000 | 0.1072128 | 0.1539561 | 0.1245225 | 0.2647568 | 0.2037882 | 0.1600790 | 0.1842714 | 0.2246974 | 0.2555626 | 0.2386067 | 0.1284772 | 0.1418276 | 0.1369821 | 0.1965649 | 0.2605496 | 0.1791216 | 0.2446367 |
| AAPL | 0.3028485 | 0.2688674 | 0.1072128 | 1.0000000 | 0.1409962 | 0.1809675 | 0.2584548 | 0.2724068 | 0.2726530 | 0.3789044 | 0.1659758 | 0.2545354 | 0.2133226 | 0.2794214 | 0.1002925 | 0.1325608 | 0.1745371 | 0.2492528 | 0.1834857 | 0.2946559 |
| ABC | 0.2831098 | 0.2413006 | 0.1539561 | 0.1409962 | 1.0000000 | 0.1746143 | 0.3612115 | 0.2360283 | 0.2096561 | 0.2048848 | 0.1725127 | 0.3267634 | 0.2718505 | 0.2205016 | 0.1087532 | 0.1189980 | 0.1478877 | 0.2732205 | 0.3849018 | 0.3044349 |
| ABMD | 0.2293371 | 0.1741804 | 0.1245225 | 0.1809675 | 0.1746143 | 1.0000000 | 0.2632927 | 0.2081403 | 0.2409888 | 0.2689047 | 0.1479566 | 0.2459297 | 0.2099626 | 0.2502158 | 0.0931469 | 0.0770499 | 0.1131560 | 0.1776666 | 0.1499866 | 0.1855646 |
From above we can see that columns and rows that have same value have higher dependency on each other usaully 1 as you can see the first row and first column value of “A” company.
Now Comparison: If we compare this matrix values to previous ones we can say that values are almost in .1 difference. So one can conlude that some of the companies values decreased and for some it is increase. Just by looking at Pearson Correlation values for each stock.
Lets move to calculating Marginal Correlation Graphs.. But first we have to calculating the bootstrap confidence interval for this we have used our own algorithm instead of using Boot function.
#Calculating bootstarp confidence interval
B = 1000 # considering only 1000 iteration for sampling purposes
brep = rep(NA, B)
for (b in 1:B){
row_idx <- sample(1:nrow(logMatrix), replace = T)
bSample <- logMatrix[row_idx,] # bootstrap sample
rownames(bSample) <- rownames(logMatrix)
sample_Cor <- cor(bSample, method = "pearson", use = "everything") # getting correlation from original matrix for sample
sample_Cor[is.na(sample_Cor)] <- 0 # replacing values
btheta <- sqrt(nrow(logMatrix))*max(sample_Cor-R) # bootstrap replicate
brep[b] <- btheta # save
}
After above we have our bootstrapped replicated values. Till now we have ∆b After we will calculate the our Emperical CDF.
Gstar<- ecdf(brep)
plot(Gstar)
From above you can see that values graph changes between 0 to 1. Comparison: Look like both of line graph is almost the same.
Now I have Fn(t) hat. I am going to move t alpha based on 0.95 confidence.
confidence <- as.numeric(quantile(brep,0.95)/sqrt(nrow(logMatrix)))
confidence
## [1] 0.1838335
From above you can notice that its value is 0.18.
Comparison: Almost same value we compare both confidence values then 0.1827636 previous value and 0.1849653 is current one.
Lets move to building PR[j,k]∈Cj,k(α) for all (j,k) →n 1−α for this We needed to build new metrix based on this condition “If we have a confidence interval Cn,α then we can put an edge whenever [−ε, +ε] ∩ Cn,α = ∅.”
We wrote function which will create matrix for edges which is based on our confidence interval and error as ee = 0.1
matrix_apply <- function(m,ee,con) {
m2 <- m
for (r in seq(nrow(m2))){
for (c in seq(ncol(m2))){
l <- m[[r,c]]-con
h <- m[[r,c]]+con
eel <- ee*-1
eeh <- ee
if(h <= eel || l >= eeh){
m2[[r,c]]<-1
}
else{
m2[[r,c]]<-0
}
}
}
return(m2)
}
ee<-0.13
m<-matrix_apply(R,ee,confidence)
knitr::kable(head(m[,1:10],6), format="markdown")
| A | UNP | AAP | AAPL | ABC | ABMD | ABT | ACN | ADBE | ADI | |
|---|---|---|---|---|---|---|---|---|---|---|
| A | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| UNP | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| AAP | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| AAPL | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| ABC | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| ABMD | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Comparison: No significant difference found in values. Although we will can notice that few values for edges changed.
From above we have 1 in the matrix when edge is possible and 0 otherwise. Moving to creating a graph from adjacency matrix as m and simplify it if there are some stocks which points to itself. But before doing I need a function which will color the stocks which has same GICS .
color_graph <- function(graph,stocks,sectors,colors){
for(i in 1:nrow(stocks)) {
row <- stocks[i,]
#assigning the colors
for(j in 1:length(sectors)) {
if(sectors[j] == row$GICSSector){
V(graph)[i]$color <- colors[j] # assigning color to each stock but assign same color which are in same sector
break
}
}
}
return (graph)
}
Lets plot the graph after using above function for coloring.
ee <- 0.13
sectors <- unique(stock$GICSSector) # getting GICS
colors <- distinctColorPalette(length(sectors)) # getting colors based on GICS
m <- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock,sectors,colors))
legend("bottomleft", legend= sectors , col = colors , bty = "n", pch=20 , pt.cex = 0.7, cex = 0.35, text.col=colors , horiz = FALSE, inset = c(0.1, 0.1))
From above you can see that with error 0.13 we get the perfect result of what we have wanted to support our claim that “stocks from the same sector cluster together”. To support this claim we have drawn the graph containing unique color for stocks that are in the same GICS
Comparison For previous years some companies are more despersed then current years from 2013 to 1018. You can clearly see onlt BBY and CM was out from the clustor but previously we had severals like CMG, AGN etc.
We will going to run this with different error values to see what will happen to our stocks when we increase error value ee
eeee<- 0.1
for( i in 1:10){
ee <-ee + 0.02
m<- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
title = paste("Error: ",ee," and Confidence Level: ",round(0.95,2),sep = "")
plot(color_graph(g,stock,sectors,colors) ,main="")
title(main =title, line = 0.5, cex.main = 0.4)
}
From above you can see that our confidence value wa 0.18 and as we increase our error value from 0.10 to 0.37 stocks from same GICS started spreading instead of merging on same clustor. But still you can see that stocks from same GICS always clustered togather this proves our hypothesis.
Comparison: If we compare graphs and values initially for 2013 to 2018 stock values are more dependent and clustored togather than 2003 to 2008 stocks values.
Now Lets move to Point 4.
To start working on this we need to decrease number of stocks otherwise the result are not statisfactory and the plot we have is not clustored togather.
-> We have decrease the size of the stocks, considering only 4 stocks from each category.
options("getSymbols.yahoo.warning"=FALSE)
options("getSymbols.warning4.0"=FALSE)
stock <- read.csv("stocks_short.csv")
symbols <- stock[["Symbol"]]
data <- NULL
for (row in symbols) {
if (is.null(data)) {
data <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
}
else {
temp <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
data <- cbind(data, temp)
}
}
cMatrix <- as.matrix(data)
colnames(cMatrix) <- symbols
cMatrix[is.na(cMatrix)] <- 0
logMatrix <- diff(log(cMatrix))
logMatrix[is.na(logMatrix)] <- 0
plot(logMatrix)
Above steps are the same for creating our Matrix X. I have just changed the CSV file to have only 34 companies names instead of 110 as I have did before.
# created empty correlation matrix
cor_mat <- matrix(1:ncol(logMatrix)**2,nrow=ncol(logMatrix),dimnames=list(colnames(logMatrix),colnames(logMatrix)))
logMatrix[is.na(logMatrix)] <- 0 # na values
logMatrix[is.infinite(logMatrix)] <- 0 # removing infinite values.
alpha <- 0.05 # considering alpha directly instead of calculating it from confidence level
for(i in seq(nrow(cor_mat))){
for(j in seq(ncol(cor_mat))){
X <- logMatrix[,i]
Y <- logMatrix[,j]
dd <- dcov.test(X,Y,index=0.01,R=100) # this test will give us distance covariance value. This will do all the testing as behind the scene as described in class
cor_mat[i,j] <- dd$p.value
}
}
After above code will run we will have cor_mat which contains our desired values. But it takes around 2 hours to run. So we have already saved it to Q2Final.RData file which has already that matrix.
load("Q2Final.RData")
For plotting we need again the matrix_apply function for getting edges for following this condition on matrix each value “if and only if we reject the null hypothesis that γ2 = 0.”. So
matrix_apply_alpha <- function(m,alpha) {
m2 <- m
for (r in seq(nrow(m2))){
for (c in seq(ncol(m2))){
val <- m[[r,c]]
if(val>=alpha){
m2[[r,c]]<-1
}
else{
m2[[r,c]]<-0
}
}
}
return(m2)
}
color_graph <- function(graph,stocks,sectors,colors){
for(i in 1:nrow(stocks)) {
row <- stocks[i,]
#assigning the colors
for(j in 1:length(sectors)) {
if(sectors[j] == row$GICSSector){
V(graph)[i]$color <- colors[j] # assigning color to each stock but assign same color which are in same sector
break
}
}
}
return (graph)
}
sectors <- unique(stock$GICSSector) # getting GICS
colors <- distinctColorPalette(length(sectors)) # getting colors based on GICS
m <- matrix_apply_alpha(cor_mat,alpha)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock,sectors,colors) ,main="")
title(main ="Marginal Correlation Graph", line = 0.5, cex.main = 0.4)
Finally, we have all the things we need. Lets move to point 5.